-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spike: Pause compaction when out of sync #10392
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not unreasonable i think, but we should test it in mainnet node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see anything massively wrong here, definitely would be good to test on a mainnet node.
@@ -396,6 +396,9 @@ func (b *Blockstore) doCopy(from, to *badger.DB) error { | |||
if workers < 2 { | |||
workers = 2 | |||
} | |||
if workers > 8 { | |||
workers = 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have confidence in this number? Did we check that it's "good enough" on mainnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check on this. There's some evidence that it might be too high right now resulting in chain getting out of sync. Alternate strategy we can use is leave as is and see if chain sync issues persist correlated to moving GC.
@vyzo for reference do you have an estimate of a reasonable time for moving GC to complete today?
cc @TippyFlitsUK you might have an even better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe 5-10 min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets take a fresh measurement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference I got 50m on my fairly resourced mainnet node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ with current default of half CPUs
I'll measure with 8 and compare when I get to similar garbage levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working without major issues on mainnet (cc @TippyFlitsUK ) and this seems to be preventing out of sync on moving GC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing to note: I'm suspicious that reducing goroutines here is the main thing causing improvement since moving GC is anecdotally the thing which causes chain sync to get left behind and we are not doing any yielding during move (yet) in this change. If we are concerned about risk of compaction deadlock we could just include the simple go routine limiting of moving GC.
// already out of sync, no signaling necessary | ||
|
||
} | ||
// TODO: ok to use hysteresis with no transitions between 30s and 1m? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we're syncing, the head will only be taken when we get fully in sync, so this is probably fine
(well, really the sync new head is only selected when we complete the previous sync, but the effect is similar)
c1afd4e
to
d9acd45
Compare
d9acd45
to
36d274a
Compare
ad0a54b
to
7a4082c
Compare
Tests are mad
|
Conditions always call "unlock" so we can't safely use the condition with both the read and write side of lock. So we might as well revert back to a regular lock. fixes #10616
0437b4e
to
68f402d
Compare
@arajasek I don't think you meant to close this, right? |
…re-oos Backport #10392 into v1.21.0
Closed by #10641 backported and merged back to master |
Related Issues
I ended up playing with some of the chain sync contention protection ideas from #10388
Proposed Changes
Use head change out of sync detection to signal compaction to pause. Signal continue when chain is back in sync
Additional Info
@vyzo if you can take a look at this for feasibility it would be a big help.
Checklist
Before you mark the PR ready for review, please make sure that:
<PR type>: <area>: <change being made>
fix: mempool: Introduce a cache for valid signatures
PR type
: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, testarea
, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps